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Abstract 

Characteristic curve approaches to linking parameters from the generalized partial credit model 
(GPCM) are examined for cases where common (anchor) items are calibrated separately in two 
groups. Three of these approaches are simple extensions of the test characteristic curve (TCC), 
item characteristic curve (ICC), and operating characteristic curve (OCC) methods that have been 
previously developed for other binary item response models. The ICC approach explicitly 
provides a symmetric solution for estimating linking constants whereas the TCC and OCC 
approaches yield an asymmetric solution. Thus, the symmetry of the result is confounded with the 
type of characteristic curve used to derive the result. New characteristic curve techniques are 
developed to eliminate this confound. Specifically, symmetric versions of the TCC and OCC 
methods are developed within the context of the GPCM along with an asymmetric version of the 
ICC technique. The accuracy of linking constant estimates and the accuracy of rescaled GPCM 
parameter estimates obtained with each method is examined in a simulation study. The study 
suggested that the TCC method yields slightly more accurate estimates of linking constants and 
model parameters as do symmetric, as opposed to asymmetric, solutions. The study suggests that 
all of the methods yield similar linking results when GPCM parameters are estimated accurately 
using large samples. 
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Exploring Alternative Characteristic Curve Approaches to Linking Parameter Estimates from the 

Generalized Partial Credit Model 

Parametric item response theory (IRT) models generally yield parameter estimates that are 
identifiable up to some arbitrary change in location (i.e., origin), and perhaps an arbitrary change 
in scale (i.e., unit) when the model contains a discrimination parameter. Constraints must be 
introduced to achieve a unique solution when estimating parameters in these models. These 
constraints, in turn, yield parameter estimates that have a data dependent metric. To overcome 
this dependency, parameter estimates are typically transformed to achieve a common metric 
before comparing estimates from different calibrations. We refer to this transformation process as 
“linking” parameter estimates. 

There have been several methods proposed to link parameter estimates using common 
(anchor) items from alternative calibrations of a given parametric IRT model (Kolen & Brennan, 
1995). Linking methods based on characteristic curve approaches are generally thought to be 
reasonably accurate and more robust to outliers as compared to methods based simply on 
summary measures of parameter estimate distributions (Baker & Al-Karni, 1991; Stocking & 
Lord, 1983). Moreover, characteristic curve methods only require access to the IRT parameter 
estimates themselves, and not to the estimated item parameter covariances or to the raw item 
response data. Therefore, this paper will focus on characteristic curve approaches to linking IRT 
parameter estimates. 

To date, there have been three distinct characteristic curve approaches developed in the IRT 
literature. These are the test characteristic curve (TCC) method (Stocking & Lord, 1983), the 
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item characteristic curve (ICC) method (Haebara, 1980), and the option characteristic curve 
method (Baker, 1993). The first two of these methods were proposed in the context of binary 
logistic models whereas the latter was developed in the context of the nominal response model 
(Bock, 1972). This paper will focus on extending each of these methods to Muraki’s (1992) 
generalized partial credit model (GPCM). The GPCM is defined in the gth calibration group as 



where: 

^{g)tk 's the operating characteristic fonction for the kth category of the ith item in the gth 
calibration group, 

®(g)/ location oftheyY/i individual from Xhtgth calibration group on the latent continuum, 

discrimination parameter for the ith item from xXit gth calibration group, 

is the location of the ith item from the gth calibration group 

—(g)/ ^ho vector of AT, threshold parameters for the ith item in the gth calibration group which 
are constrained to sum to zero, and 

Ki is the number of response categories for the ith item. 

The GPCM is interesting in that all three characteristic curve approaches can be used to link 
parameter estimates that are derived from it. However, there has not been a systematic 
comparison of these alternative characteristic curve linking methods. This paper will, among 
other things, report on such a comparison. 

The TCC, ICC, and OCC linking methods differ from each other with regard to symmetric or 
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asymmetric nature of their solutions. All of these methods attempt to derive linking constants that 

are used to transform parameter estimates from one calibration group (i.e., the transformed 
group) to the metric of estimates from the other group (i.e., the target group). The TCC and 
OCC methods yield linking constants that are dependent on which set of estimates constitute the 
target metric. If these methods are used to obtain two solutions in which the target calibration 
group is reversed, then the two results will not be simple inverses of each other. We refer to these 
types of solutions as asymmetric. In contrast, the ICC method yields a symmetric solution. If the 
ICC method is applied twice such that the roles of the target and transformation groups are 
reversed across the two applications, then the resulting solutions will be simple inverses of each 
other. As mentioned above, the past literature on characteristic curve approaches to linking IRT 
parameter estimates has confounded the notion of solution symmetry with the type of 
characteristic curve use to derive linking constants. This has important implications for 
measurement practice because the choice between a symmetric or asymmetric linking solution 
should be based on the measurement situation and not on the characteristic curve that one wishes 
to use. For example, studies of IRT parameter invariance generally estimate parameters 
separately in two groups, link the estimates to achieve a common metric, and then compare the 
linked parameter estimates. The selection of a target calibration group is arbitrary in such cases, 
and therefore, a characteristic curve approach to linking that is symmetric would be preferred. 
Alternatively, in situations where new test items are calibrated and subsequently added to a pre- 
existing item bank in which the metric is fixed, the new estimates must first be linked to the metric 
of items in the bank. In this case, there is a pre-existing target metric, and an asymmetric solution 
IS justified. Unfortunately, the previous confound between the type of characteristic curve used to 
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develop linking constants and the symmetric or asymmetric nature of the solution limits the 
practitioner’s ability to choose the method that is most appropriate for a given measurement 
situation. 

In order to more easily choose among characteristic curve methods of linking ER.T parameter 
estimates, the symmetry of a solution must be disentangled from the type of characteristic curve 
used to derive the solution. Accordingly, a second objective of this paper was to overcome the 
confound between these two features. We did this by developing symmetric versions of the TCC 
and OCC procedures and an asymmetric version of the ICC procedure. These developments are 
described below along with a description of the original TCC, ICC and OCC methods. 

The Test Characteristic Curve (TCQ Method 

Stocking and Lord (1983) proposed a TCC approach in which the scale and location of 
parameters from one calibration are linearly transformed to match the metric of parameter 
estimates from a second calibration based on a subset of common items (i.e., anchor items). The 
transformation minimized the sum of squared differences between the test characteristic curves 
associated with common items from the two sets of parameter estimates. Originally proposed for 
a 3 -parameter logistic model for binary responses, the TCC method can be generalized easily to 
the GPCM. LetOpj^.^ denote the location of the yy/; individual from the second calibration group 
on the latent continuum, and let 0 ^ 2 i);j'‘®P''®sent the location of the individual from the second 
calibration group after it has been transformed to the metric of the first calibration group. 

Similarly, let to th® location of the ith item from the second calibration, and letd^^i)/ 

represent the location of the ith item from the second calibration after it has been transformed to 
the metric of the first calibration group. The item discrimination parameter and 
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procedure estimates the linking constants, A and B, that transform the metric of parameter 
estimates from the second calibration group to that of the first as follows: 






( 2 ) 



^ 21)1 = ^ ^ 2)1 B 



^(21) it ■^‘^(2)ik 



(3) 

(4) 
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The “scale” constant A and the “location” constant B are found by minimizing the squared 
differences between the test characteristic curves associated with anchor items from the first 
and second calibration groups after transforming person parameters. Let : 



TCC 



/=1 A:=l 



^ [ ^(2)ik^^(2)j* ’ ^( 2 )/ ’ ^( 2 )/ ’ -(2)1 ^ ^ 



( 6 ) 



TCC, 



.-tt 

i=l Jt=l 



^ ^ (21) ik (®(21);+ ’ ^(1)1 ’ ^(1)/ ’ -(1)1 ^ ^ 



(7) 



where: I is the number of anchor items, 

K-^ is the number of response categories for the ith item, 

j* refers to some preselected evaluation points on the 0 continuum, 

^(2)ik operating characteristic function for common items in the second calibration 
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group, and 

P( 2 i)ik the operating characteristic function for common items in the first calibration 
group using an evaluation point on the latent continuum that has been transformed from 
the metric of the second calibration group to that for the first calibration group. 

TCC^ 2 )jt the value of the test characteristic curve at point j* for common items from the 
second calibration. TCC^^i)/* the value of the test characteristic curve at point j* for common 
items from the first calibration after the metric of the latent continuum is transformed so that the 
curve closely matches TCC^ 2 )j^ closely as possible. The TCC method attempts to find the 
values of the scale constant (A) and location constant (B) that minimize the following squared 
loss function: 

Qm = E [ rcCpw. - rCC,,,,,, ? ,8) 

J* 

In other words, the method attempts to linearly transform the metric of parameters in the second 
calibration group so that the TCC for anchor items matches the corresponding TCC in the first 
calibration group as closely as possible. In this study, j2(2i) was minimized with respect to A and 
B using a Newton-Raphson technique. This minimization technique requires the partial 
derivatives of Q( 2 \) ■ The necessary derivatives and a description of our implementation of the 
Newton-Raphson algorithm are given in a technical report available from the authors (Roberts, 

Huang & Gagne, 2003). 

The reader should note that the TCC method described above does not yield a symmetric 
solution for A and B. If one switches the roles of the first and second calibration groups in the 
above equations and recalculates the estimates of A and B, then the new solution will not 
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generally be the inverse of the original solution. Thus, the TCC method provides an asymmetric 
solution. It is also important to note that Stocking and Lord (1983) did not explicitly define what 
evaluation points (i.e., j* ) should be used to minimize ( 2 ( 2 i) . In practice, a small number of 
equally-spaced points (e.g., 21) ranging from -4 to +4 are often used (Baker, 1995). 

A symmetric version of the TCC method can be developed for the GPCM in a straightforward 
manner. Relying on Haebara’s approach to forming a symmetric quadratic loss function in the 
item characteristic curve method (see below), we can define the following additional relationships: 



0 



»(.)/, - « 



(i2);i 



( 9 ) 



In Equation 9, the term refers to the location of the jjth individual from the first calibration 
group on the latent continuum. In contrast, 0^^^^^. denotes the location of the y'yth individual from 
the first calibration group on the latent continuum after transforming it to match the metric of the 
second calibration group. We can further define the following two TCCs: 



^^^(l);> ^ 5!) ^ [ -^(1) / k (1);> ’ ^(1) / ’ ^(1) ( ’ ^(1) , ) 1 



( 10 ) 



;=1 fc=l 



TCC, 



(12); 



/=1 Jt=l 



^ (®( 12 );> ’ ^( 2)1 ’ ^( 2 )/ ’ ^( 2 ) 1 ^ ^ ' 



( 11 ) 



TCC^^^j, is the test characteristic curve at point j* for common items in the first calibration 
group, whereas TCC^^^y^ is the test characteristic curve at point j* for common items in the 
second calibration group after the metric of the latent continuum is rescaled so that the curve 
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matches as closely as possible. The degree of matching can be quantified with the 

following squared loss function; 

Q<m = E [ ^ ■ 

J* 

With these definitions in place, we can develop the following symmetric loss function: 

= QiS) - QS) ■ 

The values of the scale constant, A, and the location constant, B, which minimize this function 
are found using a Newton-Raphson technique. The necessary derivatives are given in Roberts et 
al. (2003). Note that because this function minimizes differences between test characteristic 
curves twice - first by rescaling the metric of the curve in the second calibration group and then by 
rescaling the metric of the curve in the first group - it consequently yields a symmetric solution for 
the linking constants. 

The Item Characteristic Curve Method 

Haebara (1980) proposed an alternative characteristic curve approach in which the linking 
constants, A and B, are found that minimize the sum of squared differences between item 
characteristic curves (ICC) associated with common items across two calibration groups. 
Haebara’s approach was also distinct from Stocking and Lord’s (1983) method in that it produced 
a symmetric solution. The ICC method was originally proposed for a 3 -parameter logistic model 
for binary responses, but it can be extended to the GPCM in a straightforward manner. In order 
to illustrate the squared loss function for the generalized ICC method, let us define the following 
functions: 



( 12 ) 



( 13 ) 
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ICC. 



( 1 ) 0 * 



«=1 



(14) 



~ ^ [ -^( 12 ) 1 * (®( 12 );« ’ ^(2)i > ^( 2)1 ’ ^ ^ ‘ 

it=l ^ ^ 



(15) 



ICCf^^ij! = ^ k [ -P(2)ifc (®(2)/' ’ ^{2)i ’ ^(2)< ’ ^f2W ^ ^ 
it=l ^ ^ 



(16) 



ICC^2l)ij' ~ zL ^ [ ^(ll)ik (®(21);' ’ ^(l)i ’ ^(1)< ’ -fiw ) ] 

it=l ^ ^ 



(17) 



ICC^i^ij^ is the value of the item characteristic curve at point j* for the ith common item in 
the first calibration group. ICC^ 2 )ij' value of the item characteristic curve at point j ’ for the 
ith common item in the second calibration group. ICC^^^^^j^ is the value of the item characteristic 
curve at point j* for the ith common item in the second calibration group after the metric of the 
latent continuum is transformed to match the metric of the first calibration group. Conversely, 
/CCpjj^^./ is the item characteristic curve for the ith common item at point j’ in the first 
calibration group after the metric of the latent continuum is transformed to match the metric of 
the second calibration group. The squared loss function in Haebara’s ICC method can then be 



written as: 
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Q 



ICC 



<=i [;> / 




( 18 ) 



The values of the scale constant, A, and location constant, B, that minimize are found using 
a Newton-Raphson technique. The derivatives required to perform this minimization are 
given in Roberts et al. (2003). If the roles of the first and second calibration groups are reversed 
in Equation 1 8, then the resulting solution is the inverse of the original solution, and thus, 
Haebara’s solution is symmetric. Haebara originally suggested the evaluation points, j* and j’, be 
based on the distributions of estimated 0 values in the first and second calibration groups, 
respectively. However, any reasonable set of evaluation points could be used, and in the 
remaining part of this paper, we simply define j* =j 

An asymmetric solution can easily be developed for the ICC method by simply minimizing 
either the 2^ 2(12)/°*’^^® 2^ 2pi)/ term from Equation 18 with respect to A and B. To 

i i=i 

maintain comparability with the definition of the asymmetric solution in the TCC method, we 
chose to minimize the latter term: 











( 19 ) 



Operating Characteristic Curve Method 

Baker (1993) developed an operating characteristic curve (OCC) method to link IRT 
parameter estimates. He originally proposed the method for the nominal model where calculation 
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of an ICC or TCC need not be applicable. However, Baker noted that the technique is equally 
appropriate in the case of graded polytomous responses such as those modeled in the GPCM. 

Let: 

OCC(2),|,y, = ’ ^{2)i ’ ^(2)/ ’ 

OCC^2l)ikj* ~ ^(2l)lk (®(21)y* ’ ^(1)/ ’ ^(1)/ ’ ^(1); ^ ■ (21) 



The OCC method finds the scale and location constants, A and B, that minimized the following 
squared loss function: 



Q 



OCC 

( 21 ) 






OCC, 



1=1 k=l 



(2)ikj- 



- OCC 



{2\)ikj‘\ 



( 22 ) 



The solution to this minimization problem can be solved using the Newton-Raphson algorithm 
using the derivatives given in Roberts et al. (2003). Baker (1995) has suggested that a limited 
number of evaluation points (25 or less) that are equally spaced between 0+4 can be used to 
evaluate differences between the OCCs in the two calibration groups. 

The OCC method as originally proposed by Baker (1993) is obviously asymmetric. However, 
a symmetric version , can be easily developed if we let: 

^^^{\)ikj* = ^{\)ik (®(i);* ’ ^1)/ ’ ’ ^(1), ) ’ (23) 
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), cavi 



(24) 




(25) 



With these definitions in place, a squared loss function can be developed to yield a symmetric 
solution: 



Again, this function is minimized with respect to the linking constants, A and B using standard 
techniques like the Newton-Raphson method. The derivatives of the loss function required to 
implement the method are given in Roberts et al. (2003). 



A simulation study was performed to examine and compare the behavior of each of the six 
aforementioned linking methods. In this study, item responses from two independent groups of 
simulees were generated for the same 20-item test form. These responses were generated to 
conform with the GPCM and were subsequently used to estimate GPCM parameters separately in 
each of the two groups. The parameter estimates from the two groups were linked using each of 




(26) 



Simulation Study 
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the six linking methods. Linking was based on either the entire set of common test items or on 
randomly chosen subsets. The nature of the evaluation points chosen to compare characteristic 
curves was also varied. Each aspect of the simulation is described in detail below. 

Method 

Item Characteristics 

Item parameter estimates from the GPCM published in the 1998 NAEP technical report 
(Allen, Donoghue & Schoeps, 2001) served as the true item parameters in the simulation. Only 
the 162 NAEP items with three response categories were used in an effort to simplify the 
experimental design. Of these initial items, only the 153 items with | | <2 were maintained in the 

item pool. Elimination of items with more extreme locations was done in an effort to assure that 
each response category would be used at least once for every item. On a given replication, 20 
items were randomly sampled from the item pool. Responses to these items were simulated 
independently in the two groups, and then the sampled items were returned to the pool for 
subsequent resampling. 

Sample Size and Simulee Characteristics 

In a given experimental condition, either 300 or 2000 respondents were simulated in each of 
the respondent groups. The true 0 values for these simulees were derived in one of two different 
ways. In one condition, true 0 values were generated from a N(0, 1) distribution in both 
respondent groups. In the other condition, the true 0 values were generated from a N(0, 1) 
distribution for the first group, and a N(.5, 1 .25^) distribution for the second group. Given the 
lack of standard terminology , we will refer to the first condition as a horizontal linking scenario, 
whereas the second condition will be referred to as the vertical linking scenario. True 0 values 
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were independently generated on every replication within each condition. 

Calculation of Model Parameter Estimates 

Parameter estimates were derived from simulated item responses using the PARSCALE 
computer program (Version 3.2; Muraki & Bock, 1997). A marginal maximum a posteriori 
estimation algorithm was used to derive item parameter estimates. The algorithm utilized a 
N(0,1) prior distribution for 0, a log-normal(0,.5^) prior distribution for slopes, and a N(0,2^) 
prior for item locations. Thirty equally-spaced quadrature points between -4 and +4 were used to 
obtain a solution along with a convergence criterion in which the largest absolute change for any 
item parameter estimates was less than .001 from one iteration to the next. 

After obtaining solutions for item parameters, expected a posteriori (EAP) estimates of 0 
were obtained for each simulee. These estimates were calculated using 30 equally spaced 
quadrature points between -4 and +4 along with a N(0,1) prior distribution. 

Linking Parameter Estimates 

The parameter estimates for the two calibration groups were linked using both the symmetric 
and asymmetric forms of the TCC, ICC and OCC methods. Furthermore, each of these methods 
was implemented using either 5, 10, 15 or 20 common items. The common items used in the 
linking procedure were randomly selected at the beginning of each replication. However, items 
were added to the set of common items in blocks of 5 items so that, within a given replication, 
the larger sets of common items contained the smaller sets. (In other words, 5 items were added 
to the original set of 5 common items to produce a set of 10 common items, and so on.) 
Differences between characteristic curves were evaluated at points on the latent continuum 
defined by one of three strategies. Curves were contrasted at 50 equally spaced points between 
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0=(-4, +4), at 2 W equally spaced points between 0=(-4, +4), or at the 2*A^ values of 0 that 
were estimated in the two corresponding calibration groups. 

Experimental Design 

The simulation conditions were structured as a 2 (sample size) x 2 (linking scenario) x 3 
(characteristic curve type) x 2 (solution symmetry) x 3 (evaluation points) x 4 (anchor items) 
split-plot factorial design. The first two of these factors comprised the between-replication 
conditions, whereas the last four factors were within-replication conditions. There were 100 
replications in each between-replication condition. On a given replication, the true item 
parameters were randomly sampled from the item pool, item responses were generated in each of 
the two independent groups, GPCM estimates were calculated separately in the two groups, and 
the parameter estimates were repeatedly linked using the 72 different procedures defined by the 
within-replication factors. 

Data Analyses 

Primary and secondary data analyses were performed to assess the accuracy of linking under 
the conditions that were explored. In the primary analyses, the squared error of the estimated 
scale constant, (A - A)^, and that for the estimated location constant 0 - B)^ were analyzed 
using a univariate split-plot ANOVA. Consequently, the hypotheses tested by these ANOVA 
models were framed in terms of mean squared error (MSE). The nominal Type I error rate was 
set to .025 when testing each ANOVA effect due to the fact that two dependent measures were 
studied. The probability of each within-replication effect under the null hypothesis was adjusted 
using the Hyunh-Feldt (1970) procedure. Each effect in the ANOVA model, was a classified into 
one of 16 effect families. An effect family included all the effects tested by a given error term 
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along with the error term itself. The proportion of familywise variance associated with each effect 
was calculated. This is referred to as the ripj,?. In order to limit the interpretation of trivial 
effects, only those effects that were both statistically significant and associated withrif^j^ > 03 
were interpreted in this paper. 

The secondary data analyses explored the squared deviation of GPCM parameter estimates 
from their true values using ANOVA techniques. Thus, the hypotheses for each ANOVA effect 
was framed in terms of mean squared deviations (MSD). There were two basic types of ANOVA 
models constructed to examine MSD. First, the squared error of GPCM parameter estimates 
(i.e., S, d, t?, andd) for the first calibration group was modeled as a function of sample size, 
linking scenario and their interaction. A second ANOVA model was used to investigate the 
squared error of parameter estimates from the second calibration group after they were rescaled 
to the metric of the first calibration group. These squared errors were examined using an 
ANOVA model which included all the between-replication and within-replication effects in the 
experimental design. The Type I error rate for the secondary data analyses was set to .0125 
because there were four dependent measures examined in each type of ANOVA model. As in the 
primary analyses, probability values for within-replication effects were adjusted using the Huynh- 
Feldt procedure. Given that the total number of effects tested in these secondary analyses was 
substantially larger than that for the primary analyses, the effect size criterion for interpretation 
was increased to > .05. Thus, only those effects that were statistically significant and had 
corresponding ri^ > .05 were ultimately interpreted. 

Results 

Primary Analyses 
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Table 1 lists the ANOVA effects for the MSE of A that were deemed to be interpretable on 
the basis of the previously defined operational criteria. Of these interpretable effects, some were 
main effects that were subsumed under higher order interactions. Only the higher order effects 
will be described in this paper. The MSE for A was partially a function of a two-way interaction 
between the number of anchor items used to estimate linking constants and the sample size used 
to calibrate GPCM parameters. The form of this interaction is shown in the top panel of Figure 1. 
When parameter estimates were calibrated using large samples (A=2000), the MSE for A was 
minimal, and the number of anchor items had little effect on the MSE values. However, when the 
calibration sample size was small (A=300), then the MSE for A increased noticeably; and this 
increase was inversely related to the number of anchor items used in the linking solution. The 
mean differences associated with this interaction were quite small and were confined to the third 
decimal place. However, these differences were not considered to be ignorable. For example, the 
difference between the root of the MSE incurred in the two sample size conditions with 5 anchor 
items was equal to .058, and this represented 5.8% of the true 0 standard deviation. For the 20 
anchor item condition, the difference in root MSE decreased to .032. 

Insert Table 1 About Here 

Insert Figure 1 About Here 

The interaction between linking scenario and calibration sample size was also classified as 
interpretable. This interaction is illustrated in the bottom panel of Figure 1. With a large 
calibration sample size, the MSE of A was quite small, and the linking scenario (i.e., horizontal or 
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vertical linking) had virtually no effect on its magnitude. In contrast, when the calibration sample 
size was small, the MSE of J increased for both linking scenarios, but this increase was larger in 
the vertical linking condition. Again, this effect was small, but not ignorable. The difference in 
root MSE values for A in the vertical linking condition was approximately .05 which was equal to 
5% of the standard deviation of true 0 values, whereas the difference in root MSE for the 
horizontal linking condition was equal to .032. 

There was also a three-way interaction between type of characteristic curve utilized, the 
symmetry of the solution, and calibration sample size. This interaction is displayed in Figure 2. 
The MSE for A was trivial and showed little sensitivity to the type of characteristic curve or the 
symmetry of the solution when a large calibration sample size was used. However, when a small 
calibration sample size was used, then the MSE for A increased, and a small interaction between 
the type of characteristic curve used and the symmetry of the solution emerged. The TCC 
approach produced the smallest amount of error, followed by the ICC and OCC methods, 
respectively. With the latter two methods, there appeared to be a small effect of symmetry such 
that a symmetric solution produced a slightly smaller MSE for A as compared to an asymmetric 
solution. The tiny magnitude of this interaction was corroborated by its sum of squares which 
was nearly zero. (As shown in Table 1, the sum of squares for this interaction rounded to zero at 
the third decimal place.) Given its very small magnitude, this interaction is probably of little 
practical importance. 

Insert Figure 2 About Here 

Table 2 lists the ANOVA effects corresponding to the MSE for B that were deemed to be 
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interpretable. As was the case for A, the MSE for B was a function of the two-way interaction 
between the number of anchor items used in the linking solution and the sample size used to 
calibrate GPCM parameter estimates. The means corresponding to this interaction are shown in 
the top panel of Figure 3. The pattern of means is quite similar to that found with the MSE for 
A, and suggested that the MSE B was trivial in magnitude and insensitive to the number of 
anchor items used in the linking solution when the calibration sample size was large. With a small 
calibration sample size, the MSE for B increased, but the accuracy of B improved as the number 
of anchor items grew larger. The difference in the root MSE for B between the sample size 
conditions was equal to .037 when the number of anchor items was equal to 5. This difference 
decreased to .022 when the number of anchor items was 20. 

Insert Table 2 About Here 

Insert Figure 3 About Here 

A 

The bottom panel of Figure 3 illustrates another two-way interaction in which the MSE for B 
was a function of the type of characteristic curve used in the solution and the nature of the 
evaluation points used to contrast characteristic curves. When either 50 or 2N equally spaced 
evaluation points were used, then the TCC method produced slightly more accurate estimates of 
B relative to the ICC and OCC methods. The latter two methods produced similar MSE values. 
In contrast, when the characteristic curves were evaluated at the 2N points associated with d , 
then all the characteristic curve methods yielded similar MSE values and these values were slightly 
smaller than those in other evaluation point conditions. Although this effect was interesting, it 
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was so small that it would have few, if any, pragmatic consequences. 

Secondary Analyses 

Group 1 Analyses . The two-way ANOVA on the squared difference of GPCM parameters for 
Group 1 revealed intuitive results. The main effect of calibration sample size on MSD led torjpj^r 
values equal to .63, .65 and .52 for estimates of S, d , and £?, respectively. These sample size 
effects were all statistically significant at the /?<.0001 level and all were in the predicted direction; 
larger sample sizes led to smaller MSDs (.003 versus .017 for 6, .004 versus .021 for d, and .007 
versus .043 for £?). These differences were very small in the metric of MSD. However, when 
differences were expressed in terms of root MSD, they appeared more substantial ( .052 versus 
.131 for S, .059 versus .144 for and .083 versus .208 for c?). Moreover, when these 
differences were interpreted in the light of the root pooled variance of true item parameters within 
replications (.659 for S, .293 for d, and .937 for t?) they seem more substantial. Specifically, the 
differences in root MSD represented 11.99%, 29.01%, 13.34% of the root pooled within- 
replication variance of true b, a, d parameters, respectively. 

The MSD for d in Group 1 was not statistically related to calibration sample size, linking 
scenario or their interaction. This was expected because the number of responses to informative 
items determines the accuracy of an EAP estimate, and the d were calculated from 20 item 
responses in all conditions. Additionally, the d values for Group 1 were not rescaled via the 
linking constants, and therefore, one would not necessarily expect their accuracy to be dependent 
on the linking scenario if the test provided ample information across the latent continuum. 

Group 2 Analyses . The ANOVA on the squared difference for rescaled GPCM item 
parameter estimates from Group 2 revealed several interesting findings. Table 3 lists the effects 
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that were deemed to be interpretable. With regard to the MSD for S, there was a two-way 
interaction between calibration sample size and the number of anchor items used to solve for 
linking coefficients. This interaction is depicted in the upper panel of Figure 4. As shown in the 
figure, when the calibration sample size was large, the MSD for S was minimal and insensitive to 
the number of anchor items used to solve for linking constants. In contrast, when the calibration 
sample size was small, the MSD for S increased, and slight reductions in MSD emerged as the 
number of anchor items increased. The difference in root MSD for b between the two sample 
size conditions was equal to .084 when linking was based on 5 anchor items, and it decreased to 
.072 when 20 anchor items were used. These differences represented 12.7% and 1 1.0% of the 
root pooled variance of true b values within replications. Thus, the increase in MSD due to small 
calibration sample size was mitigated very slightly by an increase in anchor items. It is also 
important to note that the MSD for S in the small calibration condition decreased very little after 
the number of anchor items reached 15. 

Insert Table 3 About Here 

Insert Figure 4 About Here 

The ANOVA for the MSD of S also revealed an interpretable interaction between the nature 
of evaluation points and the number of anchor items used in the linking solution. This effect is 
illustrated in the bottom panel of Figure 3. The interaction suggested that using estimated theta 
values from both calibration groups as points to evaluate characteristic curve differences was 
slightly preferable when a small number anchor items was employed. However with 1 5 or more 
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anchor items, using theta estimates as evaluation points produced slightly higher MSD values for 
S. Although this interaction met the criteria for interpreting an effect, the corresponding mean 
differences were negligible as suggested by the nearly zero sum of squares for this effect as shown 
in Table 3. 

Insert Figure 5 About Here 

A three-way interaction of the characteristic curve, the symmetry, and the evaluation points 
inherent in a given solution also emerge when evaluating MSD for b. The interaction is 
portrayed in Figure 5. When equally spaced evaluation points were used, then the symmetric or 
asymmetric nature of the solution had little effect on the MSD for^ unless an ICC method was 
used. In this case, an asymmetric solution seemed slightly more preferable. However, a 
symmetric solution led to slightly smaller MSD for the ICC and OCC methods when d values 
served as evaluation points. The symmetry of a solution had little impact of the MSD for S when 
the TCC method was used, regardless of the nature of evaluation points. Given the very small 
size of these mean differences and the small sum of squares associated with the interaction, it is 
doubtful that this finding will have any pragmatic consequences for practitioners. 

With regard to the MSD for there were two ANOVA effects that met the criteria for 
interpretation. The first of these was a two-way interaction between calibration sample size and 
number of anchor items which is shown in Figure 6. As was the case with S, the MSD for & was 
negligible when the calibration sample size was large, regardless of the number of anchor items 
used in the linking solution. However, for small calibration samples, the MSD for<5 was 
noticeably larger, and it decreased very slightly as the number of anchor items increased. These 
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decreases in MSD attenuated when the number of anchor items was 15 or more. The difference 

in root MSD for ^ between the two sample size conditions was equal to .096 when linking was 
based on 5 anchor items. The difference decreased to .081 when 20 items were used in the 
linking procedure. These values represented 32.9% and 27.6% of the root pooled variation in 
true a values within replications. Thus, the noticeable ill effects of a small calibration sample size 
were mitigated slightly by an increase in the number of anchor items used in the linking solution. 

Insert Figures 6 and 7 About Here 

The MSD for & was also dependent on a three-way interaction between the calibration sample 
size, the type of characteristic curve employed, and the symmetric nature the solution. This 
interaction is shown in Figure 7. With large calibration samples, the MSD for & was quite small, 
and the remaining two factors had virtually no impact on its value. With small calibration 
samples, the MSD for & increased noticeably. Additionally, there was a very slight advantage for 
the symmetric solution when using either an ICC or OCC method, as opposed to the TCC 
method. However, the size of this simple interaction effect was extremely small and is probably 
of no practical importance. 

The MSD for ^ was primarily a function of the calibration sample size and the number of 
anchor items used to solve for linking constants. Both of these main effects are displayed in 
Figure 8. As shown in the top panel of the figure, the large calibration sample led to an MSD for 
that was .032 smaller than that in the small calibration sample condition. The difference in the 
corresponding root MSD values was equal to .115 and represented 12.2% of the root pooled 
variation in true d parameters within replications. Therefore, this effect was considered to be 
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small, but still meaningful. 



Insert Figure 8 About Here 

The bottom panel of Figure 8, illustrates the main effect of the number of anchor items in the 
linking solution on the MSD for The MSD decreased as the number of anchor items 
increased, although the size of the decrease diminished once the number of anchor items had 
reached 15 or more. The MSD for ^ decreased by .002 when the number of anchor items 
increased from 5 to 10. The corresponding difference in root MSD values was also quite small 
(.006), and thus, this effect was thought to have little, if any, practical implications. 

With regard to the MSD for rescaled d values from Group 2, there were several interpretable 
effects that emerged from the ANOVA, and each of these is listed in Table 4. The strongest 
effect was the main effect of linking scenario. The MSD for rescaled d values was larger in the 
vertical linking scenario (.144) relative to the horizontal linking condition (.128). When 
transformed into root MSD, the magnitude of this difference was small and represented 2.2% of 
the standard deviation of true 0 . 

Insert Table 4 About Here 
Insert Figure 9 About Here 

The MSD for d was also dependent on the type of characteristic curved used to estimate 
linking constants. As shown in the top panel of Figure 9, the TCC method led to MSD values 
that were smaller than that for the ICC and OCC methods. However, given the very small size of 
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these differences, their impact was deemed negligible. 

There was also a two-way interaction between calibration sample size and number of anchor 
items used in the linking solution. This interaction is shown in the bottom panel of Figure 9. 

When the sample size was large, the number of anchor items used to estimate linking constants 
had little effect on the accuracy of the rescaled d . However, the MSD for d increased slightly in 
the small sample size conditions, and these increases were mitigated as the number of anchor 
items increased. When 5 anchor items were utilized, the difference in root MSD between sample 
size conditions was equal to .013. This difference fell to .003 when 20 anchor items were used in 
the linking process. Again, these differences were very small, and thought to have little practical 
importance. 

Discussion 

Results from both the primary and secondary analyses suggested several convergent findings. 
First, all the differences found among various linking conditions studied here were small. In 
many cases, the differences were extremely tiny and thought to have little pragmatic implications. 
This may be due to the fact that the simulated data fit the GPCM, and thus, reasonable parameter 
estimates were obtained in all between-replication conditions. If accurate parameter estimates are 
available, then linking parameters can be easily recovered and there is less opportunity for linking 
methods to differ (Cohen & Kim, 1998). Nonetheless, when any new method is introduced, the 
method must first be shown to work well in ideal situations before exploring its characteristics 
under more realistic conditions. In this regard, the accuracy of linking achieved with methods that 
have been generalized or created in this study was consistently reasonable. 

Although all the effects discovered in the simulation were generally quite small, there were. 
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nonetheless, consistent findings that emerged. As the calibration sample size increased, the 
accuracy of the linking methods increased, and sensitivity to variables inherent in a specific linking 
situation had little, if any, effect on the resulting scale transformation. Again, this was presumably 
do to the fact that IRT item parameters were estimated very well with large samples. When one 
has accurate item parameter estimates of anchor items in two groups, then finding the linear 
transformation that relates the metrics of the two groups is relatively easy and can be 
accomplished accurately using a variety of solution schemes. 

When the calibration sample size was small, the accuracy of the linking procedure degraded 
somewhat, and other factors began to affect the quality of the linking results ever so slightly. The 
most important of the factors studied here was the number of anchor items used to estimate 
linking constants. There was noticeable improvement in linking accuracy when the number of 
anchor items was increased from 5 to 10 items in the small calibration condition. These 
improvements attenuated once the number of linking items reached 15. The generality of this 
result is no doubt highly dependent on the particular tests that are calibrated in the two 
respondent groups. In our simulation study, the item pool consisted of all non-extreme 1998 
NAEP items with 3 response categories. One could conjecture that between 10-15 anchor items 
will be needed to achieve optimal linking results for similar item pools. Unfortunately, 
determining the similarity between a given item pool and the one used in this study is fraught with 
difficulty, and thus, further study is required before specific recommendations can be made. 

There was also an effect of the linking scenario on the accuracy of estimated linking constants 
when small calibration sample sizes were used. In this case, the scale constant estimated by the 
linking procedure was more accurate when the distributions of the two respondent groups were 
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identical. This suggests that achieving a common metric for item parameters may be more 
difficult in vertical testing situations when the calibration sample size is small. Again, when the 
sample size is large, then this issue is of little concern. 

The type of characteristic curves used to develop linking estimates and the points at which the 
curves are evaluated had systematic, but trivial effects on the accuracy of linking. The TCC 
method generally produced linking results that were slightly more accurate then the ICC method, 
which, in turn, produced slightly more accurate results when compared to the OCC method. 
However, the impact of these differences will probably have negligible impact in practice. 
Nonetheless, if one is forced to choose among these methods based only on the results of this 
simulation, then the choice seems clear. The TCC method consistently provided equal or better 
linking accuracy in every condition studied here. The choice between the type of evaluation 
points to use is a bit less clear. Again, the effects of this variable were very small and of little 
practical importance. However, evaluation of characteristic curve differences at d points usually 
led to more accurate linking. The fact that this advantage was quite small may, itself, be 
practically meaningful because linking is sometimes required in cases where there in no access to 
d (e.g., linking item parameters obtained from different published studies). In such cases, a 
limited number of equally spaced evaluation points should suffice as long as the item parameters 
are estimated well. 

One of the primary contributions of this work was the introduction of alternative characteristic 
curve methods that allow the practitioner to choose between a symmetric or asymmetric solution 
regardless of the type of characteristic curve that is employed to link parameter estimates. The 
simulation showed that use of a symmetric linking procedure generally led to equal or slightly 
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better linking accuracy relative to an asymmetric procedure. Again, these differences were so 
slight that they seemed trivial. We find the comparable accuracy of symmetric and asymmetric 
solutions encouraging because we believe that the characteristics of the measurement situation 
should determine the method used to link parameter estimates. It is comforting to know that this 
choice can be made on substantive, rather than methodological, grounds with little, if any, impact 
on linking accuracy. 

As with any simulation study, the particular results that have emerged here require further 
investigation before they can be adequately generalized. Future research should include an 
assessment of the robustness of these linking methods in situations where the data are 
systematically misfit by the GPCM or when the assumptions of the GPCM fail to hold (e.g., minor 
deviations from unidimensionality and/or respondent independence). Perhaps further distinctions 
among these linking strategies can be made under less ideal circumstances than those simulated 
here. Future research should also explore how the number of response categories affects the 
number of anchor items required to produce adequate linking accuracy. 
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Table 1 . Interpretable ANOVA effects on MSE for A . 



Source 




dfdcn 


SS 


F 




2 


N 


1 


396 


.102 


145.47 


.0001 


.249 


LS 


1 


396 


.017 


24.53 


.0001 


.042 


N*LS 


1 


396 


.013 


18.81 


.0001 


.032 


A 


3 


1188 


.047 


59.41 


.0001 


.119 


A*N 


3 


1188 


.029 


36.84 


.0001 


.074 


CC 


2 


792 


.001 


14.06 


.0001 


.033 


s*cc 


2 


792 


.000 


17.42 


.0001 


.039 


S*CC*N 


2 


792 


.000 


14.80 


.0001 


.033 



N=calibration sample size, LS=linking scenario, CC=characteristic curve method, S=symmetry of 
solution, A=number of anchor items. 

“ p-values for within-replication effects are adjusted with the Huynh-Feldt procedure. 



Alternative Characteristic Curve Approaches to Linking 35 



Table 2. Interpretable ANOVA effects on MSE for B 



Source 


dfnum 


dfdcn 


SS 


F 


P“ 


^2 

^WF 


N 


1 


396 


.027 


119.09 


.0001 


.229 


A 


3 


1188 


.013 


65.22 


.0001 


.133 


A*N 


3 


1188 


.006 


28.79 


.0001 


.059 


CC 


2 


792 


.000 


13.79 


.0001 


.033 


CC*EP 


4 


1584 


.000 


16.72 


.0001 


.039 



N=calibration sample size, A=number of anchor items, CC=characteristic curve method, 
EP=nature of evaluation points. 

“ p-values for within-replication effects are adjusted with the Huynh-Feldt procedure. 
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Table 3. Interpretable ANOVA effects on MSD for item parameter estimates. 



Source 


Parameter 


dfnum 


dfden 


SS 


F 




^2 

^WF 


N 


S 


1 


396 


1.413 


640.16 


.0001 


.613 


A 


S 


3 


1188 


.038 


88.96 


.0001 


.167 


A*N 


S 


3 


1188 


.020 


46.29 


.0001 


.087 


A*EP 


S 


6 


2376 


.000 


11.92 


.0002 


.531 


CC*S 


S 


2 


792 


.000 


23.49 


.0001 


.054 


CC*S*EP 


S 


4 


1584 


.000 


24.96 


.0001 


.056 


N 


a 


1 


396 


2.229 


877.88 


.0001 


.670 


A 


a 


3 


1188 


.046 


67.79 


.0001 


.132 


A*N 


a 


3 


1188 


.032 


46.86 


.0001 


.092 


cc*s 


a 


2 


792 


.000 


30.56 


.0001 


.068 


CC*S*N 


a 


2 


792 


.000 


24.45 


.0001 


.054 


N 




1 


396 


7.582 


426.79 


.0001 


.518 


A 




3 


1188 


.042 


33.51 


.0001 


.01 A 



N=calibration sample size, A=number of anchor items, CC=characteristic curve method, 
EP=nature of evaluation points, S=symmetry of solution. 

’ p-values for within-replication effects are adjusted with the Huynh-Feldt procedure. 
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Table 4. Interpretable ANOVA effects on MSD for 0 . 



Source 


dfnun, 


dfdcn 


SS 


F 




^2 

^WF 


LS 


1 


396 


1.897 


147.06 


.0001 


.265 


A 


3 


1188 


.101 


90.31 


.0001 


.165 


A*N 


3 


1188 


.056 


49.92 


.0001 


.091 


CC 


2 


792 


.001 


19.04 


.0001 


.044 



LS-Unking scenario, CC-characteristic curve method, A=number of anchor items, N=calibration 
sample size . 

“ p-values for within-replication effects are adjusted with the Huynh-Feldt procedure. 
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Figure Captions 

Figure 1. Two-way interactions for the MSB of A. The top panel illustrates the calibration 
sample size by number of anchor items interaction. The bottom panel illustrates the calibration 
sample size by linking scenario interaction. 

Figure 2. Three-way interaction of characteristic curve by symmetry of solution by calibration 
sample size on the MSB of A . 

.Figure 3. Two-way interactions for the MSB of The top panel illustrates the calibration 
sample size by number of anchor items interaction. The bottom panel depicts the characteristic 
curve by nature of evaluation points interaction. 

Figure 4. Two-way interactions for the MSD of S. The top panel illustrates the calibration 
sample size by number of anchor items interaction. The bottom panel depicts the nature of 
evaluation points by number of anchor items interaction. 

Figure 5. Three-way interaction of characteristic curve by symmetry of the solution by nature of 
evaluation points on the MSD of S. 

F igure 6. Two-way interaction of calibration sample size by number of anchor items on the MSD 
of a. 

Figure 7. Three-way interaction of characteristic curve by symmetry of the solution by calibration 
sample size on the MSD of d. 

Figure 8. Main effects for the MSD of A. The top panel illustrates the main effect of calibration 
sample size. The bottom panel portrays the main effect of the number of anchor items. 
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Figure 9, Interpretable effects for the MSD of d . The top panel illustrates the main effect of 

characteristic curve. The bottom panel depicts the calibration sample size by number of anchor 



items interaction. 
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Figure 5 
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Figure 6 
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Figure 7 
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Figure 8 
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Figure 9 
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